Please we need to discuss project motivation and our experiment, just as Rida asked
Then we can begin our analysis
We chose this dataset from the UCI’s machine learning repository for its categorical predictive attributes. It contains 1994 Census data pulled from the US Census database. The prediction task we’ve set forth is to predict if a person salary range is >50k in a 1994, based on the various categorical/numerical attributes in the census database. The link to the data source is below:
https://archive.ics.uci.edu/ml/datasets/census+income
The effectiveness of our algorithm will be determined by support, confidence and lift. As these are the metrics that describe how strong a relationship between each element is with the other elements within each transaction. Currently, there are no methods for cross validation of association rules, although there are some hard working individuals out there that are attempting to create such a tool.
Here we will discuss each attribute and give some description about its ranges.
First, lets go ahead and load up necessary libraries:
Loaded Packages:
arules, arulesViz, forcats, dplyr, plotly, data.table,
pander, knitr, skimr, lubridate, ggplot2, cowplot,
foreach, doParallel
Next, lets import our dataset
data <- read.csv("./data/adult-training.csv")
The first thing we must do is check and see if there are any NAs in our dataset, just to make sure to not mess up our analysis.
NA_sum <- sort(sapply(data, function(x) sum(is.na(x))), decreasing = TRUE)
data.frame((NA_sum))
Looks like we are doing ok here. The next issue we have in the dataset, is because of the way the csv was stored, some of the levels in our factors include leading and trailing whitespace. This is highly undesirable, so we must clean it up:
GetFactors <- function(df) {
return(names(Filter(is.factor, df)))
}
FixLevels <- function(x) {
levels(x) <- trimws(levels(x))
return(x)
}
data[GetFactors(data)] <- lapply(data[GetFactors(data)], FixLevels)
pander(lapply(data[GetFactors(data)], levels))
Next, we need to reencode our data as factors. First, lets encode the education levels into factors with larger groups (for example 1st-12th grade should be no diploma, not a bunch of levels).
data$education <- fct_collapse(data$education, `No Diploma` = c("1st-4th", "5th-6th",
"7th-8th", "9th", "10th", "11th", "12th", "Preschool"), Associates = c("Assoc-acdm",
"Assoc-voc"), Diploma = c("Some-college", "HS-grad"))
Then the the income brackets:
data$income_bracket <- fct_collapse(data$income_bracket, small = "<=50K", large = ">50K")
Next, lets change the ? levels to something more useful:
levels(data$workclass)[levels(data$workclass) == "?"] <- "Other"
levels(data$occupation)[levels(data$occupation) == "?"] <- "Other-service"
levels(data$native_country)[levels(data$native_country) == "?"] <- "Other"
data[GetFactors(data)] <- lapply(data[GetFactors(data)], FixLevels)
pander(lapply(data[GetFactors(data)], levels))
Next, lets remove the fnlwgt, education number, and capital gain and loss columns, as they are unneeded. We also need to rename some columns to be easier for us, and use the cut function to factorize our numeric variables
data <- data[, -c(3, 5, 11:12)]
data$age <- cut(data$age, breaks = c(15, 25, 45, 65, 100), labels = c("Young",
"Middleaged", "Senior", "Retired"))
data$hours_per_week <- cut(data$hours_per_week, breaks = c(0, 20, 40, 60, 80),
labels = c("part-time", "full-time", "hard-working", "need-a-life"))
library(magrittr)
data %>% skim_to_list() %$% factor %>% select(-c("ordered", "missing", "complete")) %>%
kable
| variable | n | n_unique | top_counts |
|---|---|---|---|
| age | 32561 | 4 | Mid: 16523, Sen: 8469, You: 6411, Ret: 1158 |
| education | 32561 | 7 | Dip: 17792, Bac: 5355, No : 4253, Ass: 2449 |
| gender | 32561 | 2 | Mal: 21790, Fem: 10771, NA: 0 |
| hours_per_week | 32561 | 4 | ful: 20052, har: 8471, par: 2928, nee: 902 |
| income_bracket | 32561 | 2 | sma: 24720, lar: 7841, NA: 0 |
| marital_status | 32561 | 7 | Mar: 14976, Nev: 10683, Div: 4443, Sep: 1025 |
| native_country | 32561 | 42 | Uni: 29170, Mex: 643, Oth: 583, Phi: 198 |
| occupation | 32561 | 14 | Oth: 5138, Pro: 4140, Cra: 4099, Exe: 4066 |
| race | 32561 | 5 | Whi: 27816, Bla: 3124, Asi: 1039, Ame: 311 |
| relationship | 32561 | 6 | Hus: 13193, Not: 8305, Own: 5068, Unm: 3446 |
| workclass | 32561 | 9 | Pri: 22696, Sel: 2541, Loc: 2093, Oth: 1836 |
Lets see the results:
We’d also like to get a quick feel for the dataset through some visulizations.
p1 <- ggplot(data) + geom_bar(aes(x = age, fill = income_bracket), stat = "count") +
labs(x = "Age", y = "Count", title = "Age by Income", subtitle = "Histogram plot")
p1
Our first plot shows us a quick histogram of age groups and their income group. Not surprisingly, the majority of the “large” income group is in middleaged and senior groups supporting the common knowledge that as you get older, you should be making more money.
p2 <- ggplot(data, aes(x = education, fill = income_bracket, color = income_bracket)) +
geom_bar(alpha = 0.9, position = "fill") + coord_flip() + labs(x = "Education",
y = "Proportion", title = "Income bias based on Education", subtitle = "Stacked bar plot")
p2
Our next plot is a stacked bar chart of income group by density of education. We see here that higher education (Bachelors, Masters, Doctorate) yields a higher proportions of the “large” income bracket. Professional school also about equal with Docotorate suggesting specilaized trades and higher education are the best bet for making over 50k a year.
p3 <- ggplot(data, aes(x = marital_status, fill = income_bracket, color = income_bracket)) +
geom_bar(alpha = 0.9, position = "fill") + coord_flip() + labs(x = "Marital Status",
y = "Proportion", title = "Income bias based on Marital status", subtitle = "Stacked bar plot")
p3
Using the similar plot style as above, we’ll now look at income group by density of Marital Status. Which too no surprise to us researchers, married couples have a distinct advantage over thoes that are unmarried. In fact the next highest categories are Widowed and Divorced. Which means, if you want to make more money in life, you should at least give marriage a try.
p4 <- ggplot(data, aes(x = occupation, fill = income_bracket, color = income_bracket)) +
geom_bar(alpha = 0.9, position = "fill") + coord_flip() + labs(x = "Occupation Status",
y = "Proportion", title = "Income bias based on Occupation status", subtitle = "Stacked bar plot")
p4
Next, we’ll compare income groups by density of Occupation. This revealed a few obvious, but also a few interesting results. Exec-managerial and Prof-specialty were the highest categories to no one’s surprise. Protective-serv, transport-moving and tech-support also show good density for making over 50k. Some of the lowest categories thought were Armed-Forces, and house-cleaning/services.
p6 <- ggplot(data, aes(occupation)) + geom_bar(aes(fill = education), width = 0.5) +
theme(axis.text.x = element_text(angle = 60, vjust = 0.5)) + labs(title = "Histogram of occupation with education binning",
subtitle = "Occupation and Educational")
p6
Lastly, we combine education and ocupation to see if we can notice any particular relationships between the two groups. What kind of educations to the people with large incomes have? Exec-managerial show a large amount of bachelor’s and masters degree’s, which means, more school definitely helps in that occupation. Prof-specialty shows the best range of educational sectors respresented but we found it interesting that prof-school wasn’t a larger representation here. This is probably due to prof-specialty being a wider category than some in that you can be a professional in alot of different occupational sectors.
Armed-forces have an extremely low amount of education information which also makes sense as most are recruited right out of high school. We would have expected to see a higher amount of Diploma’s given out in this group, but perhaps the information is just missing from this dataset.
Finally, we can set up our dataset as a transactional dataset to be in the proper data format for the Apriori algorithm:
data <- as(data, "transactions")
summary(data)
#> transactions as itemMatrix in sparse format with
#> 32561 rows (elements/itemsets/transactions) and
#> 102 columns (items) and a density of 0.1077805
#>
#> most frequent items:
#> native_country=United-States race=White
#> 29170 27816
#> income_bracket=small workclass=Private
#> 24720 22696
#> gender=Male (Other)
#> 21790 231771
#>
#> element (itemset/transaction) length distribution:
#> sizes
#> 10 11
#> 208 32353
#>
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 10.00 11.00 11.00 10.99 11.00 11.00
#>
#> includes extended item information - examples:
#> labels variables levels
#> 1 age=Young age Young
#> 2 age=Middleaged age Middleaged
#> 3 age=Senior age Senior
#>
#> includes extended transaction information - examples:
#> transactionID
#> 1 1
#> 2 2
#> 3 3
Option B: Association Rule Mining • Create frequent itemsets and association rules. • Use tables/visualization to discuss the found results. • Use several measure for evaluating how interesting different rules are. • Describe your results. What findings are the most compelling and why?
Before we begin with our analysis, lets check out the rule frequencies within the dataset. We are looking for rules with support >= .2
itemFrequencyPlot(data, support = 0.2)
Next, lets mine some rules with the apriori algorithm, and then clean up redundant rules. We are still sorting out what to set the minsupp and minconf to.
zerules <- apriori(data, parameter = list(minlen = 2, supp = 0.2, conf = 0.15),
appearance = list(rhs = c("income_bracket=small", "income_bracket=large"),
default = "lhs"), control = list(verbose = F))
length(zerules)
#> [1] 91
redundant <- is.redundant(zerules)
zerules.pruned <- zerules[redundant == FALSE]
rulesorted <- sort(zerules.pruned, by = "lift", decreasing = TRUE)
length(rulesorted)
#> [1] 25
Next, let us inspect the rules, and examine their quality
(quality(rulesorted))
inspectDT(rulesorted, caption = "Association Rules")
plot(rulesorted, method = "scatterplot", measure = c("support", "lift"), shading = "confidence",
engine = "htmlwidget")
Scatter Plot
This plot shows us each rule, with x axis as support and Y axis as lift. They are colored by confidence. This makes a nice little view of which rules were made from the most data. These may not always be the most interesting rules, but they are definitely the strongest ones. From looking at this, we can note a few strong, interesting associations.
Females are associated with low income bracket
Not being married is strongly associated with a low income bracket
High school diploma only is associated with low income
Private/housekeeping/server jobs are associated with low income
No family is also associated with low income
These strong rules are by no means earth-shattering, but they do tell us that we are looking in the right direction, as they agree with the (sometimes unfortunate) realities of the world we live in.
plot(rulesorted, method = "graph", measure = "confidence", shading = "lift",
engine = "htmlwidget")
Balloon Plot
The balloon plot is able to filter and zoom, which gives the ability to look through each income group and see what rules are feeding the analysis. Just by seperating out the large income bracket, we begin to see that married-civ-spouce has one of the highest lift scores. The next two highest were white and Male. Which follows suit with our above research as well.
In the small income bracket, we see rules such as (Never-married, fulltime), (female, private/housekeeper/server), (Private/housekeeper/server, never-married). This would begins to tell us that if you’re female, never-married and work full time, then you’re chances are higher for making under 50k. It also says something negative
plot(rulesorted, method = "paracoord", measure = "confidence", shading = "lift",
control = list(reorder = T))
Parallel Plot
The width of the arrows represents support and the color represents confidence. We can see a few interesting things here. First of all, we have a striking level of confidence that being married gets you a large income, and then also interestingly we see that working as a private worker/housekeeper/server is a part of a lot of the rules related to being in the low income bracket (You can see this by looking where all the rules converge on LHS order 1).
This plot is often used to discuss the relationship between the order of the rules and their support/confidence.
plot(rulesorted, method = "two-key plot", measure = "lift", shading = "confidence",
engine = "htmlwidget")
Two Key Plot
We can find several interesting bits of information here. First, we see that we only produced one rule of order 4, a few rules of order 3, and a bunch of order 2. We also note that the area with high support and confidence (upper right, a bit of a sweet spot), is dominated by the shorter rules.
This is a grouped matrix plot, which in this case is not interpretable or useful, but we will show it anyways
plot(rulesorted, method = "grouped", measure = c("support", "lift"), shading = "confidence")
Grouped Plot
After our first attempt at mining for rules, we’re going to relax our support to 0.1 and increase our confidence to 98% see what we get. Aftewards, we’ll target small and large incomes individually in order to see if we can isolate the itemsets that are going to be of highest influence.
rule2 <- apriori(data, parameter = list(minlen = 2, supp = 0.1, conf = 0.98),
appearance = list(rhs = c("income_bracket=small", "income_bracket=large"),
default = "lhs"), control = list(verbose = F))
length(rule2)
#> [1] 50
redundant <- is.redundant(rule2)
rulep <- rule2[redundant == FALSE]
rulesorted2 <- sort(rulep, by = "lift", decreasing = TRUE)
length(rulesorted2)
#> [1] 36
We can see that this time around we’ve generated 36 rule’s. We’ll put them through the same inspection and plots as previously to see if any other rules stand out or if we’re seeing more of the same.
head(quality(rulesorted2))
inspectDT(rulesorted2, caption = "Alternate Association Rules")
plot(rulesorted2, method = "scatterplot", measure = c("support", "lift"), shading = "confidence",
engine = "htmlwidget")
plot(rulesorted2, method = "graph", measure = "confidence", shading = "lift",
engine = "htmlwidget")
plot(rulesorted2, method = "two-key plot", measure = "confidence", shading = "lift",
engine = "htmlwidget")
plot(rulesorted2, method = "grouped", measure = c("support", "lift"), shading = "confidence")
Now that we’ve scanned a couple of different variations of apriori and generated rules accordingly. Lets split up the targeting in the rhs section to only target one of the income brackets at a time. First we’ll start with low income.
rule_small <- apriori(data, parameter = list(minlen = 2, supp = 0.12, conf = 0.95),
appearance = list(rhs = c("income_bracket=small"), default = "lhs"), control = list(verbose = F))
length(rule_small)
#> [1] 45
redundant <- is.redundant(rule_small)
rulep <- rule_small[redundant == FALSE]
rulesorted_small <- sort(rulep, by = "lift", decreasing = TRUE)
length(rulesorted_small)
#> [1] 25
head(quality(rulesorted_small))
inspectDT(rulesorted_small)
It looks like we’ve generated 25 rules for the low income bracket. Lets take a look at the scatter plot and find our releveant itemsets/rules
plot(rulesorted_small, method = "scatterplot", measure = c("confidence", "support"),
shading = "lift", engine = "htmlwidget")
We will discuss the results from small income targeting in the conclusion section.
Our next step is to now target the large income bracket and see what rules shake out
rule_large <- apriori(data, parameter = list(minlen = 2, supp = 0.05, conf = 0.55),
appearance = list(rhs = c("income_bracket=large"), default = "lhs"), control = list(verbose = F))
length(rule_large)
#> [1] 53
redundant <- is.redundant(rule_large)
rulep <- rule_large[redundant == FALSE]
rulesorted_large <- sort(rulep, by = "lift", decreasing = TRUE)
length(rulesorted_large)
#> [1] 15
head(quality(rulesorted_large))
inspectDT(rulesorted_large)
It looks like we’ve generated 25 rules for the low income bracket. Lets take a look at the scatter plot and find our releveant itemsets/rules
plot(rulesorted_large, method = "scatterplot", measure = c("confidence", "support"),
shading = "lift", engine = "htmlwidget")
We will discuss the results from large income targeting in the conclusion section.
Interesting large Income rules:
Not being married is associated with a low income. Out of our 25 best rules, 16 of them involved not being married, and our five highest lift rules all involved never married. This is clearly not a coincidence, however what this means and the cause of this is not so clear. We propose a few hypotheses about this:
Married people, motivated by love, work harder and make more money
Shallow society deems people without money harder to marry
People who are married got married because they are more social, and thus they are given more raises, perform better in interviews, etc. In contrast, people who were not sociable enough to meet someone may not perform as well in interviews etc., giving them fewer opportunities to get a job
Having just a diploma from high school does not get you out of the lower income bracket. This is unsurprising, as today you need in general a minimum of a college education to start making good money (and even that is difficult). 5 out of our 25 rules involved highest education level = high school diploma
12 out of 25 rules involved being young. This makes perfect sense, as entry level jobs which young people get do not pay very well.
8 out of 25 rules involved being in the private /housekeeping/server field. This makes sense, as these workers are severly underpaid
4 out of 25 rules involved having a child. If you have a child, you should be spending more time with them, and maybe focusing a little less on your career. Therefore, you will probably earn less. This pattern makes sense.
The rules for our small income target have a support of 12% and confidence of 95%. This means that (never-married, Diploma, Young, Private/housekeeper/server, and child) are observed in 12% of the dataset. We therefore believe that 95% of the time any of the attributes (never-married, Diploma, Young, Private/housekeeper/server, and child) are present, it will result in a small income.
Interesting large Income rules:
Thankfully, opposite to the small income analysis, large income targeting shows that being married greatly influences your income as it shows up in 9 of the 15 rules present. Meaning again that by working as a team, you get more out of life. So go find yourself a husband/wife.
The next highest frequent itemset is hard-working that shows up in 8 of the 15 rules. Which makes perfect sense that the more time you put into a job, the likelier you are for promotion and advancement within an organization.
Education = Bachelors , native_country = United-States, relationship = husband are present in 5 out of the 15 rules.
The rules for our large income target have a support of 05% and confidence of 55%. This means that (married, hard-working, Bachelors, United-States, Husband) are observed in 5% of the dataset. We therefore believe that 55% of the time any of the attributes (married, hard-working, Bachelors, United-States, Husband) are present, it will result in a large income.
Be critical of your performance and tell the reader how you current model might be usable by other parties. Did you achieve your goals? If not, can you reign in the utility of your modeling?
• How useful is your model for interested parties (i.e., the companies or organizations that might want to use it)? • How would your deploy your model for interested parties? • What other data should be collected? • How often would the model need to be updated, etc.?